[TLDR] Who Reasons in the Large Language Models?
Probing Reasoning Localization in Transformer Architectures with Diagnostic Tools
Authors: J. Shao et al.
Published on Arxiv: 2025-05-27
Link: http://arxiv.org/abs/2505.20993v1
Institutions: National Key Laboratory for Novel Software Technology, Nanjing University, China • School of Artificial Intelligence, Nanjing University, China
Keywords: large language models, transformer, output projection, reasoning, interpretability, fine-tuning, module analysis, diagnostics, mathematical reasoning, Qwen, DeepSeek-R1, attention mechanism, stethoscope for networks, efficient training, modular models
Large Language Models (LLMs) excel at tasks such as reasoning and dialogue, but the specific internal mechanisms that enable their reasoning capabilities after fine-tuning remain poorly understood. There is ongoing debate over whether reasoning arises from specific modules within the Transformer architecture or if it is a distributed or incidental phenomenon.
To address these questions, the authors hypothesize mechanisms underlying reasoning and introduce a novel analytical approach with the following contributions:
- Hypothesize that the output projection module (‘o_proj’) in the Transformer’s multi-head self-attention is primarily responsible for reasoning in LLMs.
- Present the Stethoscope for Networks (SfN) suite: a set of diagnostic tools (Delta, Merge, Freeze, Destruction Stethoscopes) to probe and manipulate individual Transformer modules for module-level analysis of reasoning versus other capabilities.
- Utilize open-source LLMs like Qwen2.5 and DeepSeek-R1-Distill as experimental platforms, and evaluate results with both quantitative and qualitative methods.
Building on the proposed solution and methodology, the authors conduct in-depth experiments and present key findings:
- The ‘o_proj’ module consistently exhibits the largest changes in weight post reasoning-specific fine-tuning, regardless of model scale (1.5B to 70B parameters).
- Transferring only the ‘o_proj’ from a strong reasoning model to a weak one yields significant gains in reasoning performance, approaching that of fully fine-tuned models.
- Restricting fine-tuning to ‘o_proj’ enables efficient specialization for reasoning, while freezing ‘o_proj’ impairs reasoning but not conversational ability. Other modules show the reverse pattern, indicating a modular division of labor.
- Evaluations span benchmarks including AIME 2024, Math500, and GPQA Diamond, with careful reference to base and fully fine-tuned models.
These experimental results lead to a set of important conclusions:
- Reasoning abilities in LLMs are predominantly localized in the output projection (‘o_proj’), while conversational skills rely more on other modules.
- Selective tuning of ‘o_proj’ greatly streamlines reasoning-focused training, offering faster and more memory-efficient convergence.
- The Stethoscope for Networks tools enhance interpretability and support modular LLM design for domain-specific needs, though future research must broaden diagnostics and deepen theoretical insights.